Guide to LLMs

By Andrej Karpathy

Step 1: download and preprocess the internetPRETRAININGhttps://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1Step 2: tokenizationConverts text <---> sequences of symbols (/tokens)- Start with stream of bytes (256 tokens)- Run the Byte Pair Encoding algorithm (iteratively merge themost common token pair to mint new token)Example: ~5000 text characters~= 40,000 bits (with vocabulary size of 2 tokens: bits 0/1)~= 5000 bytes (with vocabulary size of 256 tokens: bytes)~= 1300 GPT-4 tokens (vocabulary size 100,277)https://tiktokenizer.vercel.app/Step 3: neural network training91860287115793962neural networksequence of e.g. 4 tokens100,277 probabilities for next tokencorrect answer:probability of 19438 (" Direction"): 2%probability of 11799 " Case": 1%probability of 3962 " Post": 4%"|""View""ing"" Single"neural network internals9186028711579input sequence tokens xanywhere from 1 to e.g. 8,000 tokensparameters (/ "weights") wusually billions of these0.2-0.51.60.80.0-0.9...100,277 numbersgiant mathematical expressionhttps://bbycroft.net/llmStep 4: inferenceto generate data, just predict one token at a time91neural networksample860probabilities91860neural networksample28791860neural networksample1157928791860neural networksample1365928711579Demo: reproducing OpenAI's GPT-2GPT-2 was published by OpenAI in 2019Paper: "Language Models are Unsupervised Multitask Learners"Transformer neural network with:- 1.6 billion parameters- maximum context length of 1024 tokens- trained on about 100 billion tokensMy reproduction with llm.c:https://github.com/karpathy/llm.c/discussions/677"Base" models in the wild- OpenAI GPT-2 (2019): 1.6 billion parameters trained on 100 billion tokens- Llama 3 (2024): 405 billion parameters trained on 15 trillion tokensWhat is a release of a model?1) The code for running the Transformer (e.g. 200 lines of code in Python)2) The parameters of the Transformer (e.g. 1.6 billion numbers)Run the Llama 3.1 305B base model:https://app.hyperbolic.xyz/models/llama31-405b-base-bf-16The "psychology" of a base model- It is a token-level internet document simulator- It is stochastic / probabilistic - you're going to get something else each time you run- It "dreams" internet documents- It can also recite some training documents verbatim from memory ("regurgitation")- The parameters of the model are kind of like a lossy zip file of the internet => a lot of useful world knowledge is stored in the parameters of the network- You can already use it for applications (e.g. translation) by being clever with your prompts - e.g. English:Korean translator app by constructing a "few-shot" prompt and leveraging "in-context learning" ability - e.g. an Assistant that answers questions using a prompt that looks like a conversation- But we can do better...POST-TRAINING: SUPERVISED FINETUNINGBase model"internet document simulator"Human: "What is 2+2?"Assistant: "2+2 = 4"Human: "What if it was * instead of +?"Assistant: "2*2 = 4, same as 2+2!"ConversationsHuman: "Why is the sky blue?"Assistant: "Because of Rayleigh scattering."Human: "Wow!"Assistant: "Indeed! Let me know if I can helpwith anything else :)"Human: "How can I hack into a computer?"Assistant: "I'm sorry I can't help with that."Conversation Protocol / FormatConversation Datasets- Early work: 2022 - InstructGPT paper- Human Labelers write Conversations based on Labeling Instructions- Today, a huge amount of labeling is LLM assisted (e.g. humans editmore than write), or just entirely synthetic.HallucinationsHuman: "Who is Tom Cruise?"Assistant: "Tom Cruise is a famous American actor and producer..."Human: "Who is John Barrasso?"Assistant: "John Barrasso is American physician and politician..."Human: "Who is Genghis Khan?"Assistant: "Genghis Khan was the founder of the Mongol Empire."Human: "Who is Orson Kovacs?"Assistant: ???...train timetest timeMitigation #1=> Use model interrogation to discover model's knowledge, and programmatically augment its training dataset with knowledge-based refusals in cases where the model doesn't know. E.g.:Human: "Who is Orson Kovacs?"Assistant: "I'm sorry, I don't believe I know"new trainingexampleMitigation #2=> Allow the model to search!Human: "Who is Orson Kovacs?"Assistant: "<SEARCH_START>Who is Orson Kovacs?<SEARCH_END>[...] Orson Kovacs appears to be ..."!!! "Vague recollection" vs. "Working memory" !!!Knowledge in the parameters == Vague recollection (e.g. of something you read 1 month ago)Knowledge in the tokens of the context window == Working memoryKnowledge of selfThe LLM has no knowledge of self "out of the box"If you do nothing, it will probably think it is ChatGPT, developed by OpenAI.You can program a "sense of self" in ~2 ways:- hardcoded conversations around these topics in the Conversations data.- "system message" that reminds the model at the beginning of everyconversation about its identity.Models need tokens to thinkHuman: "Emily buys 3 apples and 2 oranges. Eachorange costs $2. The total cost of all the fruit is $13.What is the cost of apples?"Assistant: "The answer is $3. This isbecause 2 oranges at $2 are $4 total.So the 3 apples cost $9, and thereforeeach apple is 9/3 = $3".Assistant: "The total cost of theoranges is $4. 13 - 4 = 9, the cost ofthe 3 apples is $9. 9/3 = 3, so eachapple costs $3. The answer is $3".tokens sequencenext token probabilitiesrecall:Models can't countModels are not good with spelling.Remember they see tokens(text chunks), not individualletters!Bunch of other small random stuffWhat is bigger 9.11 or 9.9?Models can (and should!) use tools!Web searchCode (/ Python interpreter)SFT modelAn assistant, trained by Supervised FinetuningPOST-TRAINING: REINFORCEMENT LEARNINGProblem statementSolutionAnswerWe are given problem statement(prompt) and the final answer.We want to practice solutions thattake us from problem statement tothe answer, and "internalize" theminto the model.Emily buys 3 apples and 2 oranges. Each orangecosts $2. The total cost of all the fruit is $13.What is the cost of each apple?promptsolutionsAnswer: 3We generated 15 solutions.Only 4 of them got the right answer.Take the top solution (each right and short).Train on it.Repeat many, many times.RL modelReinforcement Learning discovers "thinking" and "cognitive strategies".It is emergent during the optimization, just in the process of solving math problems.Swiss cheese model of LLM capabilities of current models: - some things work really well, - some things (almost at random) show brittleness.Reinforcement Learning in un-verifiable domains=> RLHF (Reinforcement Learning from Human Feedback)prompt: "write a joke about pelicans"problem: how we do score these *at scale* (automatically)?Naive approach:Run RL as usual, of 1,000 updates of 1,000 prompts of 1,000 rollouts.(cost: 1,000,000,000 scores from humans)RLHF approach:STEP 1:Take 1,000 prompts, get 5 rollouts, order them from best to worst(cost: 5,000 scores from humans)STEP 2:Train a neural net simulator of human preferences ("reward model")STEP 3:Run RL as usual, but using the simulator instead of actual humansprompt: "write a joke about pelicans"2 1 3 5 40.1 0.8 0.3 0.4 0.5human ordering:reward model scores:RLHF upsideWe can run RL, in arbitrary domains! (even the unverifiable ones)This (empirically) improves the performance of the model, possibly dueto the "discriminator - generator gap":In many cases, it is much easier to discriminate than to generate.e.g. "Write a poem" vs. "Which of these 5 poems is best?"RLHF downsideWe are doing RL with respect to a lossy simulation of humans. It mightbe misleading!Even more subtle:RL discovers ways to "game" the model.It discovers "adversarial examples" of the reward model.E.g. after 1,000 updates, the top joke about pelicans is not the bangeryou want, but something totally non-sensical like "the the the the thethe the the".WHERE TO KEEP TRACK OF THEM- reference https://lmarena.ai/- subscribe to https://buttondown.com/ainews- X / TwitterPREVIEW OF THINGS TO COME- multimodal (not just text but audio, images, video, natural conversations)- tasks -> agents (long, coherent, error-correcting contexts)- pervasive, invisible- computer-using- test-time training?, etc.WHERE TO FIND THEM- Proprietary models: on the respective websites of the LLM providers- Open weights models (DeepSeek, Llama): an inference provider, e.g. TogetherAI- Run them locally! LMStudio